Airbnb is a prime example of a disruptive innovation, that is now one of the largest marketplaces for accomodation with over 7 million properties in more than 220 countries. With this project we sought to utilize scraped data from Airbnb listings to carry out statistical analyses and ultimately predict the total cost for two people staying four nights in the city of Munich, Germany. After initial cleaning and wrangling of the dataset, an exploratory data analysis (EDA) was carried out to investigate existing relationships between variables, especially within and between price, neighbourhood / region, room and property type, as well as reviews and cancellation policy. As will be explained in greater detail below, we grouped the neighbourhoods within zones based on both personal experience and an official map of zones based on accomodation quality and price from the city of Munich. Key observations within our EDA were that there is a heavily right skewed distribution of price and reviews, and that no linear relationship regarding price could be observed; this led us to use the log of the total price for 4 days going forward with our regression. We progressively improved our model of regression by investigating the effect of all variables as displayed through t- and p-values. Our final and best model includes the most extensive list of variables, including for example the addition of logical variables for the only two significant amenities (elevator and shampoo). We ultimately arrived at an adjusted R-squared value of around 40%. Given that the correlation matrix and other early analyses showed rather weak / limited relationships between variables, we believe this is a good result based on the given dataset. Lastly, plots of residuals (i.e. QQ-plot, residuals vs. fitted) as well as variation inflation factor analyses showed that all assumptions of a linear regression (L-I-N-E) were met. Source

(Chunk 1: Introduction of project)

Importing original data and libraries we need:

library(tidyverse)  # Load ggplot2, dplyr, and all the other tidyverse packages
library(mosaic)
library(ggthemes)
library(GGally)
library(readxl)
library(here)
library(skimr)
library(janitor)
library(broom)
library(infer)
library(vroom)
library(stringr)
library(leaflet)
library(broom)
library(huxtable)
library(ggfortify)
library(rsample)
listings <- vroom("http://data.insideairbnb.com/germany/bv/munich/2020-06-20/data/listings.csv.gz") %>% 
    clean_names()
#glimpse(listings) # checking variable headers

1 Data Preprocessing

1.1 Selecting variables and changing to the relevant type

We first select all potentially relevant variables from our data frame. The data is cleaned into number or factor to begin Exploratory Data Analysis (EDA). The raw dataset we create here is called “munich_listings”.

#Selecting all the relevant variables
munich_listings<- listings %>% 
  select(id, 
         host_is_superhost,
         host_listings_count,
         neighbourhood_cleansed,
         latitude,
         longitude,
         property_type,
         room_type,
         accommodates,
         bathrooms,
         bedrooms,
         beds,
         bed_type,
          #square_feet, we noticed that a lot of values are missing so excluded this variable
         price,
         security_deposit,
         cleaning_fee,
         guests_included,
         extra_people,
         minimum_nights,
         maximum_nights,
         number_of_reviews,
         reviews_per_month,
         review_scores_rating,
         review_scores_accuracy,
         review_scores_cleanliness,
         review_scores_checkin,
         review_scores_communication,
         review_scores_location,
         review_scores_value,
         is_location_exact,
         amenities,
         instant_bookable,
         cancellation_policy,
         availability_365,
         availability_90,
         last_review,
         listing_url,
         last_scraped) %>% 
#Converting characters to "doubles" and factors where appropriate
  mutate(neighbourhood_cleansed=factor(neighbourhood_cleansed),
         property_type,
         room_type=factor(room_type),
         price=parse_number(price),
         security_deposit=parse_number(security_deposit),
         cleaning_fee=parse_number(cleaning_fee),
         extra_people=parse_number(extra_people),
         cancellation_policy=factor(cancellation_policy),
         bed_type=factor(bed_type),
         amenities_count= str_count(listings$amenities, ","))



#Inspecting data frame to make sure all the variables are correctly attributed
glimpse(munich_listings) 
## Rows: 11,172
## Columns: 39
## $ id                          <dbl> 36720, 97945, 114695, 127383, 157808, 1...
## $ host_is_superhost           <lgl> FALSE, TRUE, TRUE, TRUE, FALSE, FALSE, ...
## $ host_listings_count         <dbl> 1, 1, 3, 2, 0, 1, 1, 1, 2, 1, 1, 1, 2, ...
## $ neighbourhood_cleansed      <fct> Ludwigsvorstadt-Isarvorstadt, Hadern, B...
## $ latitude                    <dbl> 48.1, 48.1, 48.1, 48.2, 48.2, 48.1, 48....
## $ longitude                   <dbl> 11.6, 11.5, 11.6, 11.6, 11.6, 11.5, 11....
## $ property_type               <chr> "Apartment", "Apartment", "Apartment", ...
## $ room_type                   <fct> Entire home/apt, Entire home/apt, Entir...
## $ accommodates                <dbl> 2, 2, 5, 4, 2, 3, 4, 2, 2, 2, 2, 1, 16,...
## $ bathrooms                   <dbl> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,...
## $ bedrooms                    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ beds                        <dbl> 1, 1, 3, 1, 1, 1, 2, 1, 1, 0, 1, 1, 0, ...
## $ bed_type                    <fct> Futon, Real Bed, Real Bed, Real Bed, Re...
## $ price                       <dbl> 95, 80, 95, 120, 35, 55, 55, 65, 54, 67...
## $ security_deposit            <dbl> 100, NA, 500, NA, 100, 0, 200, NA, 190,...
## $ cleaning_fee                <dbl> 30, 10, 60, 28, 10, 60, 20, NA, 32, NA,...
## $ guests_included             <dbl> 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 2, 1, 1, ...
## $ extra_people                <dbl> 30, 10, 50, 0, 15, 30, 15, 0, 0, 0, 0, ...
## $ minimum_nights              <dbl> 2, 2, 2, 2, 1, 3, 2, 3, 1, 2, 3, 2, 1, ...
## $ maximum_nights              <dbl> 730, 90, 30, 14, 36, 90, 1125, 14, 4, 3...
## $ number_of_reviews           <dbl> 25, 131, 53, 84, 0, 33, 467, 64, 211, 8...
## $ reviews_per_month           <dbl> 0.34, 1.23, 0.49, 0.76, NA, 0.31, 4.39,...
## $ review_scores_rating        <dbl> 98, 97, 95, 98, NA, 93, 99, 91, 97, 97,...
## $ review_scores_accuracy      <dbl> 10, 10, 9, 10, NA, 9, 10, 9, 10, 10, 10...
## $ review_scores_cleanliness   <dbl> 10, 10, 10, 10, NA, 9, 10, 9, 10, 10, 9...
## $ review_scores_checkin       <dbl> 10, 10, 10, 10, NA, 9, 10, 10, 10, 10, ...
## $ review_scores_communication <dbl> 10, 10, 10, 10, NA, 10, 10, 10, 10, 10,...
## $ review_scores_location      <dbl> 10, 9, 9, 10, NA, 9, 10, 9, 10, 10, 10,...
## $ review_scores_value         <dbl> 9, 9, 9, 10, NA, 9, 10, 9, 9, 10, 9, 10...
## $ is_location_exact           <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRU...
## $ amenities                   <chr> "{TV,\"Cable TV\",Internet,Wifi,Kitchen...
## $ instant_bookable            <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, TRUE...
## $ cancellation_policy         <fct> strict_14_with_grace_period, flexible, ...
## $ availability_365            <dbl> 0, 82, 59, 6, 0, 142, 260, 90, 0, 111, ...
## $ availability_90             <dbl> 0, 2, 48, 6, 0, 4, 46, 90, 0, 43, 0, 89...
## $ last_review                 <date> 2017-07-22, 2019-10-03, 2019-10-06, 20...
## $ listing_url                 <chr> "https://www.airbnb.com/rooms/36720", "...
## $ last_scraped                <date> 2020-06-21, 2020-06-20, 2020-06-21, 20...
## $ amenities_count             <int> 10, 35, 36, 37, 24, 37, 32, 19, 31, 22,...

In “munich_listing”, we have 11172 items and 46 columns. Here are some noticeable changes we made: 1. neighbourhood_cleansed, room_type, cancellation_policy and bed_type are changed into factors. 1. price, security_deposit, cleaning_fee, extra_people and amenities_count are changed into numbers.

So now we have: host_is_superhost, is_location_exact, instant_bookable as logical variable neighbourhood_cleansed, room_type, bed_type, cancellation_policy as factor variable amenitiesm, property_type as character variable

1.2 Data cleaning

We now create a new data frame called “munich_listings_cleaned” to do some required changes. Here, we deal with missing values/NAs, and clean the data for property type. Also, we filter the items upon min/max nights and accommodates for the 2 people to live for 4 nights.

#Filter dataset for two people and 4 nights
#Clean dataset for cleaning_fee, security_deposit, property_type, minimum_nights and accommodates
munich_listings_cleaned <- munich_listings %>%
  mutate(cleaning_fee = case_when(      #considering cleaning_fee as 0 if displayed as NA
    is.na(cleaning_fee) ~ 0, 
    TRUE ~ cleaning_fee),
    security_deposit = case_when(      #considering security_deposit as 0 if displayed as NA
    is.na(security_deposit) ~ 0, 
    TRUE ~ security_deposit),
    prop_type_simplified = case_when(   #regrouping of property_types: put all less popular property types into "Other"
    property_type %in% c("Apartment",
                         "House",
                         "Condominium",
                         "Loft")~ property_type , 
    TRUE ~ "Other"),
    prop_type_simplified=factor(prop_type_simplified)) %>% #creating factors
  filter(minimum_nights<=4, 
         maximum_nights>=4, 
         accommodates>=2) #filtering dataframe for 2 people and 4 nights

#Visually inspecting cleaned data set
#glimpse(munich_listings_cleaned)
#skim(munich_listings_cleaned)

For the NAs: We assume NA as 0 in cleaning fee and security deposit, which means we can book Airbnb without paying for these 2 services. So we didn’t make deletion here.

For property_type: We arranged the data set and find the top 5 kinds of Airbnbs in Munich, which are Apartment, House, Condominium, Loft and others. We transferred the variable into factors.

Filtering: We filter the room with minimum_night and maximum_night so that they can be booked for a 4-night stay. Also, the room should accommodate at least 2 people.

1.3 Calculating total price

Then, we construct the formula for the total price of 4 days into data frame munich_listings_total_price: We create total_price_4_days as our target variable for regression representing total price of 4-night stay of two people. Note that the if_else statement allows us to include the option of adding 1 extra guest to an AirBnB that has accommodates = 1. The final multiplier of 1.142 is the 14.2% (service fee)[https://www.airbnb.co.uk/help/article/1857/what-are-airbnb-service-fees#:~:text=Host%20service%20fee&text=This%20fee%20is%20calculated%20from,deducted%20from%20the%20host%20payout] for AirBnB bookings that the company charges per booking.

munich_listings_total_price<-munich_listings_cleaned %>% 
  mutate(total_price_4_days=price*4+ #calculating the total price for 4 days 2 guests
           cleaning_fee+
           if_else(guests_included==1, 
                   extra_people*4,0))

1.4 Creating a new data frame for further analysis

We now create a new data frame called “munich_listings_region” grouping the Airbnbs geographically and making some changes for the subsequent analysis.

Three variable classes are created: region: grouped into 5 by the average price of each neighborhood rating_group: grouped into 3 by whether the rating is over 90 Amenities: Many different amenity words were checked for significance, only two remained. Interestingly theyre shampoo and elevator.

munich_listings_region <- munich_listings_total_price %>% 
  mutate(
region = case_when( #creating variable that clusters neighbourhoods for further analysis
      neighbourhood_cleansed=="Altstadt-Lehel"~"zone_1",
      neighbourhood_cleansed=="Ludwigsvorstadt-Isarvorstadt"~"zone_1",
      neighbourhood_cleansed=="Maxvorstadt"~"zone_1",
      neighbourhood_cleansed=="Schwabing-West"~"zone_2",
      neighbourhood_cleansed=="Au-Haidhausen"~"zone_2",
      neighbourhood_cleansed=="Sendling"~"zone_2",
      neighbourhood_cleansed=="Sendling-Westpark"~"zone_2",
      neighbourhood_cleansed=="Schwanthalerhöhe"~"zone_1",
      neighbourhood_cleansed=="Neuhausen-Nymphenburg"~"zone_3",
      neighbourhood_cleansed=="Moosach"~"zone_5",
      neighbourhood_cleansed=="Milbertshofen-Am Hart"~"zone_5",
      neighbourhood_cleansed=="Schwabing-Freimann"~"zone_3",
      neighbourhood_cleansed=="Bogenhausen"~"zone_4",
      neighbourhood_cleansed=="Berg am Laim"~"zone_4",
      neighbourhood_cleansed=="Tudering-Riem"~"zone_1",
      neighbourhood_cleansed=="Ramersdorf-Perlach"~"zone_5",
      neighbourhood_cleansed=="Obergiesing"~"zone_2",
      neighbourhood_cleansed=="Untergiesing-Harlaching"~"zone_4",
      neighbourhood_cleansed=="Thalkirchen-Obersendling-Forstenried-Fürstenried-Solln"~"zone_3",
      neighbourhood_cleansed=="Hadern"~"zone_5",
      neighbourhood_cleansed=="Pasing-Obermenzing"~"zone_3",
      neighbourhood_cleansed=="Aubing-Lochhausen-Langwied"~"zone_4",
      neighbourhood_cleansed=="Allach-Untermenzing"~"zone_3",
      neighbourhood_cleansed=="Feldmoching-Hasenbergl"~"zone_3",
      neighbourhood_cleansed=="Laim"~"zone_5"
      ),
rating_group= case_when( #clustering review_scores_rating to 2 groups
  review_scores_rating <90 ~ "Under 90",
  TRUE ~ "Over 90"),
# is_pool=case_when(
#   grepl("Pool", amenities, fixed=TRUE) ~ TRUE,
#   TRUE ~FALSE),
# is_gym=case_when(
#   grepl("Gym", amenities, fixed=TRUE) ~ TRUE,
#   TRUE ~FALSE),
# is_private_entrance=case_when(
#   grepl("Private entrance", amenities, fixed=TRUE) ~ TRUE,
#   TRUE ~FALSE),
# is_balcony=case_when(
#   grepl("balcony", amenities, fixed=TRUE) ~ TRUE,
#   TRUE ~FALSE),
# is_kitchen=case_when(
#   grepl("Kitchen", amenities, fixed=TRUE) ~ TRUE,
#   TRUE ~FALSE),
is_elevator=case_when( # turned out to be significant
  grepl("Elevator", 
        amenities, 
        fixed=TRUE) ~ TRUE,
  TRUE ~FALSE),
# is_washer=case_when(
#   grepl("Washer", amenities, fixed=TRUE) ~ TRUE,
#   TRUE ~FALSE),
# is_dryer=case_when(
#   grepl("Dryer", amenities, fixed=TRUE) ~ TRUE,
#   TRUE ~FALSE),
# is_free_parking=case_when(
#   grepl("Free parking on premises", amenities, fixed=TRUE) ~ TRUE,
#   TRUE ~FALSE),
# is_paid_parking=case_when(
#   grepl("Paid parking off premises", amenities, fixed=TRUE) ~ TRUE,
#   TRUE ~FALSE),
# is_essentials=case_when(
#   grepl("Essentials", amenities, fixed=TRUE) ~ TRUE,
#   TRUE ~FALSE),
is_shampoo=case_when( #turned out to be significant
  grepl("Shampoo", 
        amenities, 
        fixed=TRUE) ~ TRUE,
  TRUE ~FALSE))
# is_host_greets_you=case_when(
#   grepl("Host greets you", amenities, fixed=TRUE) ~ TRUE,
#   TRUE ~FALSE),
# is_garden=case_when(
#   grepl("Garden or backyard", amenities, fixed=TRUE) ~ TRUE,
#   TRUE ~FALSE))


munich_listings_region <- munich_listings_region %>%  #cleaning dataframe from all the missing values
 na.omit()

1.5 Key variable descriptions

Here are description of the key variables in our dataset:

dependent variable:

total_price_4_days

independent variable:

property_type: type of accommodation (House, Apartment, etc.)

room_type: - Entire home/apt (guests have entire place to themselves)

  • Private room (Guests have private room to sleep, all other rooms shared)

  • Shared room (Guests sleep in room shared with others)

number_of_reviews: Total number of reviews for the listing

review_scores_rating: Average review score (0 - 100)

longitude , latitude: geographical coordinates to help us locate the listing

region: factor. Region the Airbnb is at grouping by house price. factored 1-5 from high price to low price

prop_type_simplified: type of accommodation (House, Apartment, Loft, Condominium)

room_type:Entire home/apt, Private room, Shared room

number_of_reviews: Total number of reviews for the listing

reviews_per_month: Number of reviews per month

review_scores_: Rating for in reviews in different aspects

rating_group: Average review score (0 - 100) grouped by 90

longitude , latitude: geographical coordinates to help us locate the listing

region: factor. Region the Airbnb is at grouping by house price. factored 1-5 from high price to low price

availability_365: Available days in the last 365 days

is_elevator and is_shampoo: Whether there is elevator or shampoo facilitated

2 Exploratory Data Analysis

Now that we have cleaned our data sets for our specific target (4 nights, 2 people) we will conduct a exploratory data analysis.

##Summary statistics and favstats

#summary to check for NA's and general statistics
summary(munich_listings_region)
##        id           host_is_superhost host_listings_count
##  Min.   :   36720   Mode :logical     Min.   :  0        
##  1st Qu.:12333597   FALSE:5045        1st Qu.:  1        
##  Median :22583936   TRUE :980         Median :  1        
##  Mean   :22560646                     Mean   :  2        
##  3rd Qu.:33438176                     3rd Qu.:  2        
##  Max.   :43697251                     Max.   :652        
##                                                          
##                   neighbourhood_cleansed    latitude      longitude   
##  Ludwigsvorstadt-Isarvorstadt: 717       Min.   :48.1   Min.   :11.4  
##  Maxvorstadt                 : 666       1st Qu.:48.1   1st Qu.:11.5  
##  Schwabing-West              : 446       Median :48.1   Median :11.6  
##  Neuhausen-Nymphenburg       : 434       Mean   :48.1   Mean   :11.6  
##  Au-Haidhausen               : 408       3rd Qu.:48.2   3rd Qu.:11.6  
##  Schwabing-Freimann          : 339       Max.   :48.2   Max.   :11.7  
##  (Other)                     :3015                                    
##  property_type                room_type     accommodates     bathrooms  
##  Length:6025        Entire home/apt:3544   Min.   : 2.00   Min.   :0.0  
##  Class :character   Hotel room     :  34   1st Qu.: 2.00   1st Qu.:1.0  
##  Mode  :character   Private room   :2374   Median : 2.00   Median :1.0  
##                     Shared room    :  73   Mean   : 2.92   Mean   :1.1  
##                                            3rd Qu.: 4.00   3rd Qu.:1.0  
##                                            Max.   :16.00   Max.   :4.5  
##                                                                         
##     bedrooms          beds                bed_type        price     
##  Min.   : 0.00   Min.   : 0.00   Airbed       :  10   Min.   :  11  
##  1st Qu.: 1.00   1st Qu.: 1.00   Couch        :  30   1st Qu.:  56  
##  Median : 1.00   Median : 1.00   Futon        :  30   Median :  83  
##  Mean   : 1.11   Mean   : 1.61   Pull-out Sofa: 162   Mean   : 116  
##  3rd Qu.: 1.00   3rd Qu.: 2.00   Real Bed     :5793   3rd Qu.: 128  
##  Max.   :20.00   Max.   :30.00                        Max.   :9000  
##                                                                     
##  security_deposit  cleaning_fee   guests_included  extra_people  
##  Min.   :   0     Min.   :  0.0   Min.   :  1.0   Min.   :  0.0  
##  1st Qu.:   0     1st Qu.:  0.0   1st Qu.:  1.0   1st Qu.:  0.0  
##  Median :   0     Median : 19.0   Median :  1.0   Median : 10.0  
##  Mean   : 169     Mean   : 22.4   Mean   :  1.5   Mean   : 15.4  
##  3rd Qu.: 200     3rd Qu.: 35.0   3rd Qu.:  2.0   3rd Qu.: 25.0  
##  Max.   :4500     Max.   :300.0   Max.   :200.0   Max.   :250.0  
##                                                                  
##  minimum_nights maximum_nights   number_of_reviews reviews_per_month
##  Min.   :1.00   Min.   :     4   Min.   :  1       Min.   : 0.01    
##  1st Qu.:1.00   1st Qu.:    17   1st Qu.:  2       1st Qu.: 0.12    
##  Median :2.00   Median :  1125   Median :  6       Median : 0.32    
##  Mean   :1.88   Mean   :   653   Mean   : 22       Mean   : 0.69    
##  3rd Qu.:2.00   3rd Qu.:  1125   3rd Qu.: 18       3rd Qu.: 0.77    
##  Max.   :4.00   Max.   :111360   Max.   :688       Max.   :16.36    
##                                                                     
##  review_scores_rating review_scores_accuracy review_scores_cleanliness
##  Min.   : 20.0        Min.   : 2.00          Min.   : 2.00            
##  1st Qu.: 93.0        1st Qu.:10.00          1st Qu.: 9.00            
##  Median : 97.0        Median :10.00          Median :10.00            
##  Mean   : 94.6        Mean   : 9.67          Mean   : 9.43            
##  3rd Qu.:100.0        3rd Qu.:10.00          3rd Qu.:10.00            
##  Max.   :100.0        Max.   :10.00          Max.   :10.00            
##                                                                       
##  review_scores_checkin review_scores_communication review_scores_location
##  Min.   : 2.00         Min.   : 2.0                Min.   : 2.00         
##  1st Qu.:10.00         1st Qu.:10.0                1st Qu.: 9.00         
##  Median :10.00         Median :10.0                Median :10.00         
##  Mean   : 9.75         Mean   : 9.8                Mean   : 9.63         
##  3rd Qu.:10.00         3rd Qu.:10.0                3rd Qu.:10.00         
##  Max.   :10.00         Max.   :10.0                Max.   :10.00         
##                                                                          
##  review_scores_value is_location_exact  amenities         instant_bookable
##  Min.   : 2.00       Mode :logical     Length:6025        Mode :logical   
##  1st Qu.: 9.00       FALSE:1177        Class :character   FALSE:4096      
##  Median : 9.00       TRUE :4848        Mode  :character   TRUE :1929      
##  Mean   : 9.25                                                            
##  3rd Qu.:10.00                                                            
##  Max.   :10.00                                                            
##                                                                           
##                   cancellation_policy availability_365 availability_90
##  flexible                   :2029     Min.   :  0      Min.   : 0.0   
##  moderate                   :2081     1st Qu.:  0      1st Qu.: 0.0   
##  strict                     :   0     Median :  0      Median : 0.0   
##  strict_14_with_grace_period:1913     Mean   : 70      Mean   :23.9   
##  super_strict_30            :   1     3rd Qu.: 90      3rd Qu.:55.0   
##  super_strict_60            :   1     Max.   :365      Max.   :90.0   
##                                                                       
##   last_review         listing_url         last_scraped        amenities_count
##  Min.   :2012-10-04   Length:6025        Min.   :2020-06-20   Min.   : 0.0   
##  1st Qu.:2018-09-30   Class :character   1st Qu.:2020-06-20   1st Qu.:12.0   
##  Median :2019-10-04   Mode  :character   Median :2020-06-20   Median :17.0   
##  Mean   :2019-03-01                      Mean   :2020-06-20   Mean   :18.6   
##  3rd Qu.:2020-01-06                      3rd Qu.:2020-06-21   3rd Qu.:25.0   
##  Max.   :2020-06-21                      Max.   :2020-06-21   Max.   :82.0   
##                                                                              
##   prop_type_simplified total_price_4_days    region          rating_group      
##  Apartment  :5276      Min.   :   44      Length:6025        Length:6025       
##  Condominium: 178      1st Qu.:  279      Class :character   Class :character  
##  House      : 215      Median :  392      Mode  :character   Mode  :character  
##  Loft       :  91      Mean   :  513                                           
##  Other      : 265      3rd Qu.:  568                                           
##                        Max.   :36050                                           
##                                                                                
##  is_elevator     is_shampoo     
##  Mode :logical   Mode :logical  
##  FALSE:3349      FALSE:2471     
##  TRUE :2676      TRUE :3554     
##                                 
##                                 
##                                 
## 
#running favstats on some interesting variable combinations and keeping the most interesting ones
favstats(price~accommodates, data=munich_listings_region) 
accommodatesminQ1medianQ3maxmeansdnmissing
21550  70100       999       84.9     65.5     35030
31262  90130       1e+03       109       79       8500
41179  110180       8e+03       153       292       11230
53593.8144200       1.12e+03182       147       1960
63995.2180300       1e+03       231       191       1980
73489  135190       700       176       129       430
825129  249450       995       324       254       610
965125  226288       950       311       325       60
1025196  294612       1.45e+03437       375       180
11149262  4752.74e+039e+03       2.52e+034.32e+0340
12125280  350585       800       421       231       90
133939  3939       39       39              10
14185242  300360       420       302       118       30
163535  35111       839       145       250       100
favstats(price~neighbourhood_cleansed, data=munich_listings_region)
neighbourhood_cleansedminQ1medianQ3maxmeansdnmissing
Allach-Untermenzing1842  75  110  530       111  113  350
Altstadt-Lehel2580  120  180  800       153  111  2220
Au-Haidhausen2560  85  120  1.45e+03115  116  4080
Aubing-Lochhausen-Langwied1641.565  149  380       99  81.3560
Berg am Laim2555  76.5131  400       103  75.71100
Bogenhausen2357  80  120  500       97.966.42970
Feldmoching-Hasenbergl2545  62.598.2350       88.468  740
Hadern1545  79  100  350       84.458.4730
Laim2050  80  121  585       100  82.82160
Ludwigsvorstadt-Isarvorstadt2870  100  150  9e+03       172  499  7170
Maxvorstadt2865  90  140  999       122  107  6660
Milbertshofen-Am Hart1249  70  100  400       86  58.62490
Moosach2550  70  100  800       101  113  1030
Neuhausen-Nymphenburg2152.279.5120  899       104  83.74340
Obergiesing1550  80  130  700       109  98.92130
Pasing-Obermenzing2146  70  125  800       105  104  1190
Ramersdorf-Perlach1545  60  90  420       75  49.42150
Schwabing-Freimann2055  80  120  1e+03       106  101  3390
Schwabing-West1156.280  120  1e+03       107  89.64460
Schwanthalerhöhe             NaN    00
Sendling2559.290  135  590       113  88.52580
Sendling-Westpark2052  80  120  990       109  105  2080
Thalkirchen-Obersendling-Forstenried-Fürstenried-Solln2550  75  120  1.12e+0398.893.52110
Tudering-Riem3050  75  120  999       127  152  1610
Untergiesing-Harlaching2860  80  120  500       107  77.71950
favstats(price~host_is_superhost, data=munich_listings_region)
host_is_superhostminQ1medianQ3maxmeansdnmissing
FALSE1159851309e+03119  210  50450
TRUE18507511089099.486.99800
favstats(price~prop_type_simplified, data=munich_listings_region)
prop_type_simplifiedminQ1medianQ3maxmeansdnmissing
Apartment115885  1258e+03112  15952760
Condominium195589.5154995140  1501780
House204565  10089096.81082150
Loft357598  1409e+03232  938910
Other205388  130999145  1882650
favstats(price~minimum_nights, data=munich_listings_region)

minimum_nightsminQ1medianQ3maxmeansdnmissing
11552801208e+03       113200  22570
21159851259e+03       114219  25290
31560891441.45e+03126130  9350
4236090140800       11482.73040
## Correlation Matrix

From the summary and favstats investigations, we have decided to conduct further exploratory data analysis through ggplot2. We will first build a correlation martix to spot the relationships between the particular variables.

munich_listing_is_numeric<-munich_listings_region[,sapply(munich_listings_region,is.numeric),with=FALSE]%>%
  na.omit() #We have created a dataframe that contains only numerical variables from our original dataframe in order to build the Correlation Matrix.

corMatrix <- as.data.frame(cor(munich_listing_is_numeric))
corMatrix$var1 <- rownames(corMatrix)
corMatrix2 <- corMatrix %>%
  gather(key = var2, value = r, 1:28) # selecting coloumns from dataframe
ggplot(corMatrix2,aes(x = var1, y = var2, fill = r)) +
  geom_tile() +
  geom_text(aes(label = round(r, 2)), size = 6) +
  scale_fill_gradient2(low = "#ff585d", #adding colour to matrix
                       high = "#00bf6f", 
                       mid = "white") +
  labs(title = "Correlation Matrix") +
  theme(axis.text.x = element_text(angle = 90, 
                                   hjust = 1,
                                   size=16),
        axis.text.y=element_text(size=16))

## Further analysis for collinear variables

munich_listing_is_numeric[,18:24]%>% #We tried to spot the correlation between the review-related variables using ggpairs plot
  ggpairs()

munich_listing_is_numeric%>% #We used the ggpairs plot to further analyse the bottom left part of our correlation matrix
  select(accommodates,bathrooms,
         bedrooms,
         beds,
         cleaning_fee,
         extra_people,
         guests_included,
         total_price_4_days,
         security_deposit)%>%
  ggpairs()

The correlation matrix above displays two key ‘green zones’ where there are moderate to strong correlations present between variables. In the upper right corner, the plot illustrates the positive correlations between the various review score components, indicating that when an Airbnb scores well on one criterium it will tend to also have a higher rating on the other criteria. The strongest correlatio here is between the total review score and the review score for accuracy, at a level of 0.74. In the lower left corner we can see positive correlations between variables ranging from weak to strong. As one would expect, the number of people an Airbnb in Munich accomodates has a strong positive correlation with the number of beds and the number of bedrooms. There is a moderatore positive correlation between the total accomodated and the cleaning fee. Lastly, there is a moderate positive correlation between the cleaning fee and the security deposit, likely attributable to the fact that these properties are of a higher standard, as is mentioned on Airbnb’s website (deposits are usually based on a home’s features).

Looking at our independent variable of interest for this project, the total price for a 4-day stay for two people, we only find weak positive correlations when disregarding the obvious connection to daily price. With a level of 0.29 there is a weak to moderate positive correlation between the total price and the number of people an Airbnb can accommodate; this is further supported by weak correlations (0.22) between total price and the number of bedrooms and beds. We will now continue to investigate relationships between our variables, in particular categorical variables not included in the above matrix.

2.1 Informative visualisations

ggplot(listings,aes(x=number_of_reviews))+
  geom_histogram(binwidth = 4)+
    xlim(0,250)+
    ylim(0,1000)+
     labs(title="Most of the airbnb accomodations have up to 20 reviews", 
          x="Number of Reviews", 
          y="Quantity")+
  theme_bw()

ggplot(munich_listing_is_numeric,
       aes(x=extra_people,y=total_price_4_days))+
  geom_point()+
  geom_smooth(method="lm")+
    ylim(0,3000)+
    xlim(0,100)+
      labs(title="Higher the Extra People Charge, the Higher the Overall Price", 
           x="Price per extra person", 
           y="Total price for 4 nights")+
  theme_bw()

#heavily right-skewed
ggplot(munich_listings_region, aes(x=total_price_4_days))+
  geom_density(bins=20)+
  xlim(0,4000) +
  labs(title="The density plot of total price for 4 nights is heavily right-skwed", 
       x="Density",  
       y="Total price for 4 nights")+
  theme_bw()

#log also heavily right skewed
ggplot(munich_listings_total_price, aes(x=total_price_4_days))+
  geom_density(bins=20)+
  scale_x_log10()+
  xlim(0,2500) +
  labs(title="Logarithmic Total Price Shows Nature of Price Clusters", 
       x="Density",  
       y="Total price for 4 nights")+
  theme_bw()

#histogram to show frequency
ggplot(munich_listings_total_price, aes(x=total_price_4_days))+
  geom_histogram(bins=100)+
  xlim(0,2500)+
  labs(title="Most Airbnbs cost around €300 for 4 Nights", 
       x="Total price for 4 nights", 
       y= "Quantity")+
  theme_bw()

munich_listings_region %>%
  group_by(room_type) %>%
  summarize(mean_price_roomtype = mean(total_price_4_days)) %>%
  arrange(desc(mean_price_roomtype)) %>%
  ggplot(aes(y=reorder(room_type, mean_price_roomtype), x = mean_price_roomtype)) + 
    geom_col() +
      labs(title="What are The Most Expensive Airbnb's in Munich?",
          subtitle="Hotels! who would have thought?", 
           x="Average price for 4 nights per room",  
           y="Room type")+
  theme_bw()

#Calculated mean price for 4 nights per room type

munich_listings_region %>%
  group_by(neighbourhood_cleansed) %>%
  summarize(mean_price_neighbourhood = mean(total_price_4_days)) %>%
  arrange(desc(mean_price_neighbourhood)) %>%
  ggplot(aes(y=reorder(neighbourhood_cleansed, mean_price_neighbourhood), x=mean_price_neighbourhood)) +
    geom_col()+
     labs(title="Average price for 4 nights per in particular neighbourhoods", 
          x="Average price for per room",  
          y="Neighbourhood")+
  theme_bw()

#Calculated mean price for 4 nights per neighbourhood
munich_listings_region %>%
  group_by(prop_type_simplified) %>%
  summarize(mean_price_property = mean(total_price_4_days)) %>%
  arrange(desc(mean_price_property)) %>%
  ggplot(aes(y=reorder(prop_type_simplified, mean_price_property), x = mean_price_property)) + 
    geom_col() +
      labs(title="Lofts Come at a Premium in Munich, Houses Present a Good Value Proposition", 
           x="Average price for per room",  
           y="Property type")+
  theme_bw()

#Calculated mean price for 4 nights per property type
munich_listings_region %>%
  group_by(prop_type_simplified) %>%
  mutate(count_property=count("Apartment")) %>%
  arrange((count_property)) %>%
  ggplot(aes(x=reorder(prop_type_simplified, desc(count_property)), y = count_property)) + 
    geom_col() +
      labs(title="Apartments Dominate Airbnb's Listings", 
           x="Property type",  
           y="Quantity")+
  theme_bw()

#Calculated count of particular property types

munich_listings_region %>%
 group_by(cancellation_policy) %>%
  ggplot(aes(x=reorder(cancellation_policy,total_price_4_days ), y = total_price_4_days)) + 
    geom_boxplot() +
      labs(title="Average prices per 4 nights for an Airbnb according to particular cancellation policies", 
           y="Price",  
           x="Cancellation policy")+
  scale_y_log10(limits=c(100,10000))+
  theme_bw()

#Calculated average price for particular cancellation policies

2.2 Mapping

Now, we will conduct the mapping of our locations on the Munich map. We decided to colour our data in regards to a particular zone they are located in, to have a better sense of the density of the accommodation in these zones. The zones were grouped by highest mean rental price, since it created the largest significance in our models later on.

pallette <- colorFactor(c("red", "blue", "green", "yellow","purple"), domain = c("zone_1", "zone_2", "zone_3", "zone_4","zone_5"))

    leaflet(data = munich_listings_region) %>% 
  addProviderTiles("OpenStreetMap.Mapnik") %>% 
  addCircleMarkers(lng = ~longitude, 
                   lat = ~latitude, 
                 radius = 2,
                 color = ~pallette(region),
                   fillColor = ~region,
                   group = ~ region,
                   clusterId=~region,
                   fillOpacity = 0.4,
                   popup = ~listing_url, 
                   label = ~paste( prop_type_simplified, "Min nights", "=", minimum_nights))

3 Regression

Now we will start building our models. We will start from models with only a few variables and we will gradually try to build the model with the best fitting data and the biggest possible adjusted R-squared value. Running each model, we will as well check the colinearity analysis to cut confounding variables. For that reason we will use `car::vif(model_x)`` to calculate the Variance Inflation Factor (VIF) for our predictors. A general guideline is that a VIF larger than 5 or 10 is large, and our model may suffer from colinearity. We will remove the variable in question and run our model again without it if such a VIF occurs.

For our models we will use the log value of total_prices_4_days since the distribution of it is more bell shaped than the regular value and thus will be better descried by our model.

3.1 Model 1

We will fit our first regression model called model1 with the following explanatory variables: prop_type_simplified, number_of_reviews, and review_scores_rating.

#Regression using log because normally distributed.
model1 <-lm(log(total_price_4_days)~prop_type_simplified+
              number_of_reviews+
              review_scores_rating, 
            data=munich_listings_region)
msummary(model1)
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      6.081471   0.084086   72.32  < 2e-16 ***
## prop_type_simplifiedCondominium  0.146014   0.044434    3.29   0.0010 ** 
## prop_type_simplifiedHouse       -0.111465   0.040595   -2.75   0.0061 ** 
## prop_type_simplifiedLoft         0.286691   0.061669    4.65  3.4e-06 ***
## prop_type_simplifiedOther        0.109228   0.036817    2.97   0.0030 ** 
## number_of_reviews               -0.001290   0.000157   -8.23  2.3e-16 ***
## review_scores_rating            -0.000471   0.000883   -0.53   0.5942    
## 
## Residual standard error: 0.583 on 6018 degrees of freedom
## Multiple R-squared:  0.0181, Adjusted R-squared:  0.0172 
## F-statistic: 18.5 on 6 and 6018 DF,  p-value: <2e-16
car::vif(model1)
##                      GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.01  4               1
## number_of_reviews    1.01  1               1
## review_scores_rating 1.01  1               1
#Noticed that variable review_scores_rating and "Other" and "House" categories in prop_type_simplified are also insignificant. Dropping review_scores_rating.

After running model1, we can notice, that “review_scores_rating” is insignificant for our linear regression model as the p-value is bigger than 0.05.Therefore we will drop it. Our dummy variable “prop_type_simplified” turned out to be insignificant for Houses and Other property types. Anyway, we will keep the variable prop_type_simplified as some of it’s variables are important for our model. Our Adjusted R-squared in this model is only 2,25%. We will try to fit more variables in our model in order to increase the accuracy.

We will add as well an example of interpretation of our data in logarithmic lm model.

The coefficient interpretation of review_scores_rating in regards to total_price_4_days is as follows: If the review_scores_rating increases by one, the total_price_4_days decreases by 0,0003%.

The coefficient interpretation of prop_type_simplified in regards to total_price_4_days is as follows: In regards to a particular property type the total_price_4_days behaves as follows: - (property type: Apartment) : total_price_4_days just takes the “Intercept” variable and increases by 6,08%. - (property type: Condominium) : prop_type_simplifiedCondominium=1; total_price_4_days increases by 0.18%. - (property type: House): prop_type_simplifiedHouse=1; total_price_4_days decreases by 0,065%. - (property type: Loft): prop_type_simplifiedLoft=1; total_price_4_days increases by 0.301%. - (property type: Other): prop_type_simplifiedOther=1; total_price_4_days increases by 0.06%.

3.2 Model 2

We want to determine if room_type is a significant predictor of the cost for 4 nights, given everything else in the model. We will fit a regression model that includes all of the explanatory variables in model1 plus room_type.

model2 <-lm(log(total_price_4_days)~prop_type_simplified+
              number_of_reviews+
              review_scores_rating+
              room_type, 
            data=munich_listings_region)
msummary(model2)
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      6.149860   0.080197   76.68  < 2e-16 ***
## prop_type_simplifiedCondominium  0.115210   0.042271    2.73  0.00644 ** 
## prop_type_simplifiedHouse        0.043474   0.039020    1.11  0.26526    
## prop_type_simplifiedLoft         0.197614   0.058641    3.37  0.00076 ***
## prop_type_simplifiedOther        0.112799   0.037391    3.02  0.00257 ** 
## number_of_reviews               -0.001201   0.000149   -8.07  8.6e-16 ***
## review_scores_rating             0.000333   0.000842    0.40  0.69229    
## room_typeHotel room              0.309529   0.101913    3.04  0.00240 ** 
## room_typePrivate room           -0.377170   0.014942  -25.24  < 2e-16 ***
## room_typeShared room            -0.242816   0.065713   -3.70  0.00022 ***
## 
## Residual standard error: 0.553 on 6015 degrees of freedom
## Multiple R-squared:  0.116,  Adjusted R-squared:  0.115 
## F-statistic: 87.8 on 9 and 6015 DF,  p-value: <2e-16
car::vif(model2)
##                      GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.20  4            1.02
## number_of_reviews    1.01  1            1.01
## review_scores_rating 1.02  1            1.01
## room_type            1.20  3            1.03

The room_type has increased our adjusted R-squared up to 0.13. The p-value for each room type is less than 0,05, thus the room type variable is important and we will keep it in our model.

3.3 Model 3

Are the number of bathrooms, bedrooms, beds, or size of the house (accommodates) significant predictors of price_4_nights?

model3 <-lm(log(total_price_4_days)~prop_type_simplified+
              number_of_reviews+
              room_type+
              bathrooms+
              bedrooms+
              beds+
              accommodates, 
            data=munich_listings_region)
msummary(model3)
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      5.567317   0.026569  209.54  < 2e-16 ***
## prop_type_simplifiedCondominium  0.081447   0.038663    2.11   0.0352 *  
## prop_type_simplifiedHouse       -0.103669   0.036075   -2.87   0.0041 ** 
## prop_type_simplifiedLoft         0.110756   0.053773    2.06   0.0395 *  
## prop_type_simplifiedOther        0.009137   0.034343    0.27   0.7902    
## number_of_reviews               -0.001494   0.000136  -10.96  < 2e-16 ***
## room_typeHotel room              0.538304   0.093237    5.77  8.2e-09 ***
## room_typePrivate room           -0.248967   0.014497  -17.17  < 2e-16 ***
## room_typeShared room            -0.286653   0.059913   -4.78  1.8e-06 ***
## bathrooms                        0.121782   0.023689    5.14  2.8e-07 ***
## bedrooms                         0.052133   0.013074    3.99  6.8e-05 ***
## beds                            -0.024632   0.008246   -2.99   0.0028 ** 
## accommodates                     0.146740   0.006864   21.38  < 2e-16 ***
## 
## Residual standard error: 0.504 on 6012 degrees of freedom
## Multiple R-squared:  0.264,  Adjusted R-squared:  0.263 
## F-statistic:  180 on 12 and 6012 DF,  p-value: <2e-16
car::vif(model3)
##                      GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.27  4            1.03
## number_of_reviews    1.02  1            1.01
## room_type            1.36  3            1.05
## bathrooms            1.22  1            1.11
## bedrooms             1.96  1            1.40
## beds                 2.49  1            1.58
## accommodates         2.52  1            1.59

All the variables in our model apart from “beds” variable ware significant as t-value of these variables is more than 2. In our further models we will keep “bedrooms”, “bathrooms” and “accommodates”, however we will drop the “beds”.

3.4 Model 4

Do superhosts (host_is_superhost) command a pricing premium, after controlling for other variables?

model4 <-lm(log(total_price_4_days)~prop_type_simplified+
              number_of_reviews+
              room_type+
              bathrooms+
              bedrooms+
              accommodates+
              host_is_superhost, 
            data=munich_listings_region)
msummary(model4)
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      5.579593   0.026354  211.72  < 2e-16 ***
## prop_type_simplifiedCondominium  0.076031   0.038645    1.97   0.0492 *  
## prop_type_simplifiedHouse       -0.107762   0.036075   -2.99   0.0028 ** 
## prop_type_simplifiedLoft         0.108861   0.053816    2.02   0.0431 *  
## prop_type_simplifiedOther        0.000282   0.034238    0.01   0.9934    
## number_of_reviews               -0.001474   0.000142  -10.38  < 2e-16 ***
## room_typeHotel room              0.540528   0.093317    5.79  7.3e-09 ***
## room_typePrivate room           -0.247167   0.014536  -17.00  < 2e-16 ***
## room_typeShared room            -0.291356   0.059946   -4.86  1.2e-06 ***
## bathrooms                        0.117383   0.023657    4.96  7.2e-07 ***
## bedrooms                         0.038997   0.012329    3.16   0.0016 ** 
## accommodates                     0.136191   0.005883   23.15  < 2e-16 ***
## host_is_superhostTRUE           -0.012966   0.018451   -0.70   0.4823    
## 
## Residual standard error: 0.505 on 6012 degrees of freedom
## Multiple R-squared:  0.263,  Adjusted R-squared:  0.262 
## F-statistic:  179 on 12 and 6012 DF,  p-value: <2e-16
car::vif(model4)
##                      GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.25  4            1.03
## number_of_reviews    1.10  1            1.05
## room_type            1.37  3            1.05
## bathrooms            1.22  1            1.10
## bedrooms             1.74  1            1.32
## accommodates         1.85  1            1.36
## host_is_superhost    1.10  1            1.05

Superhosts do not command a pricing premium in Munich, therefore we will drop this variable in our further models. We can see that the VIF for bedrooms and accommodates has a bit higher VIF, however it is still not high enough to worry about it.

3.5 Model 5

Most owners advertise the exact location of their listing (is_location_exact == TRUE), while a non-trivial proportion don’t. After controlling for other variables, is a listing’s exact location a significant predictor of price_4_nights?

model5 <-lm(log(total_price_4_days)~prop_type_simplified+
              number_of_reviews+
              room_type+bathrooms+
              bedrooms+accommodates+
              is_location_exact, 
            data=munich_listings_region)
msummary(model5)
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      5.581507   0.028925  192.96  < 2e-16 ***
## prop_type_simplifiedCondominium  0.075406   0.038658    1.95   0.0512 .  
## prop_type_simplifiedHouse       -0.108313   0.036068   -3.00   0.0027 ** 
## prop_type_simplifiedLoft         0.108223   0.053811    2.01   0.0444 *  
## prop_type_simplifiedOther        0.000280   0.034239    0.01   0.9935    
## number_of_reviews               -0.001500   0.000137  -10.99  < 2e-16 ***
## room_typeHotel room              0.539701   0.093317    5.78  7.7e-09 ***
## room_typePrivate room           -0.247933   0.014506  -17.09  < 2e-16 ***
## room_typeShared room            -0.291236   0.059974   -4.86  1.2e-06 ***
## bathrooms                        0.117470   0.023677    4.96  7.2e-07 ***
## bedrooms                         0.039067   0.012329    3.17   0.0015 ** 
## accommodates                     0.136165   0.005883   23.15  < 2e-16 ***
## is_location_exactTRUE           -0.003974   0.016473   -0.24   0.8094    
## 
## Residual standard error: 0.505 on 6012 degrees of freedom
## Multiple R-squared:  0.263,  Adjusted R-squared:  0.262 
## F-statistic:  179 on 12 and 6012 DF,  p-value: <2e-16
car::vif(model5)
##                      GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.25  4            1.03
## number_of_reviews    1.02  1            1.01
## room_type            1.36  3            1.05
## bathrooms            1.22  1            1.10
## bedrooms             1.74  1            1.32
## accommodates         1.85  1            1.36
## is_location_exact    1.01  1            1.00

The variable “is_location_exact” does not have a significant influence on the price of an Airbnb in Munich (p-value bigger than 0.05). Therefore, we will drop it.

3.6 Model 6

Now we will use a variable that we created - “region” that clusters all the neighbourhood to 5 zones and we will see how the location affects the price for Airbnb in our model.

model6 <-lm(log(total_price_4_days)~prop_type_simplified+
              number_of_reviews+
              room_type+
              bathrooms+
              bedrooms+
              accommodates+
              region, 
            data=munich_listings_region)
msummary(model6)
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      5.713665   0.027665  206.53  < 2e-16 ***
## prop_type_simplifiedCondominium  0.067621   0.037781    1.79  0.07353 .  
## prop_type_simplifiedHouse       -0.051486   0.035537   -1.45  0.14745    
## prop_type_simplifiedLoft         0.112969   0.052587    2.15  0.03174 *  
## prop_type_simplifiedOther        0.014181   0.033591    0.42  0.67291    
## number_of_reviews               -0.001600   0.000133  -11.99  < 2e-16 ***
## room_typeHotel room              0.487240   0.091290    5.34  9.8e-08 ***
## room_typePrivate room           -0.233688   0.014208  -16.45  < 2e-16 ***
## room_typeShared room            -0.266269   0.058622   -4.54  5.7e-06 ***
## bathrooms                        0.113342   0.023132    4.90  9.8e-07 ***
## bedrooms                         0.042178   0.012056    3.50  0.00047 ***
## accommodates                     0.136502   0.005750   23.74  < 2e-16 ***
## regionzone_2                    -0.134851   0.017317   -7.79  8.0e-15 ***
## regionzone_3                    -0.191581   0.018510  -10.35  < 2e-16 ***
## regionzone_4                    -0.194712   0.022643   -8.60  < 2e-16 ***
## regionzone_5                    -0.331023   0.020700  -15.99  < 2e-16 ***
## 
## Residual standard error: 0.493 on 6009 degrees of freedom
## Multiple R-squared:  0.297,  Adjusted R-squared:  0.295 
## F-statistic:  169 on 15 and 6009 DF,  p-value: <2e-16
car::vif(model6)
##                      GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.28  4            1.03
## number_of_reviews    1.02  1            1.01
## room_type            1.37  3            1.05
## bathrooms            1.22  1            1.10
## bedrooms             1.75  1            1.32
## accommodates         1.85  1            1.36
## region               1.04  4            1.01

The region of Munich has a significant influence on the price. T-value of all the zone is way more than |2| and our adjusted R-squared went up - it suggests that model 6 better describes the real data than our previous models.

3.7 Model 7

What is the effect of cancellation_policy on price_4_nights, after we control for other variables?

model7 <-lm(log(total_price_4_days)~prop_type_simplified+
              number_of_reviews+
              room_type+
              bathrooms+
              bedrooms+
              accommodates+
              region+
              cancellation_policy, 
            data=munich_listings_region)
msummary(model7)
##                                                 Estimate Std. Error t value
## (Intercept)                                     5.684569   0.028818  197.26
## prop_type_simplifiedCondominium                 0.058108   0.037529    1.55
## prop_type_simplifiedHouse                      -0.058809   0.035306   -1.67
## prop_type_simplifiedLoft                        0.113839   0.052225    2.18
## prop_type_simplifiedOther                       0.030525   0.033403    0.91
## number_of_reviews                              -0.001666   0.000133  -12.52
## room_typeHotel room                             0.495942   0.090674    5.47
## room_typePrivate room                          -0.227447   0.014170  -16.05
## room_typeShared room                           -0.279824   0.058242   -4.80
## bathrooms                                       0.108761   0.022977    4.73
## bedrooms                                        0.044928   0.011978    3.75
## accommodates                                    0.130261   0.005750   22.66
## regionzone_2                                   -0.129482   0.017208   -7.52
## regionzone_3                                   -0.182387   0.018411   -9.91
## regionzone_4                                   -0.188113   0.022498   -8.36
## regionzone_5                                   -0.321764   0.020590  -15.63
## cancellation_policymoderate                     0.002518   0.015381    0.16
## cancellation_policystrict_14_with_grace_period  0.132608   0.015965    8.31
## cancellation_policysuper_strict_30              0.131256   0.491330    0.27
## cancellation_policysuper_strict_60              0.025796   0.490214    0.05
##                                                Pr(>|t|)    
## (Intercept)                                     < 2e-16 ***
## prop_type_simplifiedCondominium                 0.12160    
## prop_type_simplifiedHouse                       0.09582 .  
## prop_type_simplifiedLoft                        0.02931 *  
## prop_type_simplifiedOther                       0.36084    
## number_of_reviews                               < 2e-16 ***
## room_typeHotel room                             4.7e-08 ***
## room_typePrivate room                           < 2e-16 ***
## room_typeShared room                            1.6e-06 ***
## bathrooms                                       2.3e-06 ***
## bedrooms                                        0.00018 ***
## accommodates                                    < 2e-16 ***
## regionzone_2                                    6.1e-14 ***
## regionzone_3                                    < 2e-16 ***
## regionzone_4                                    < 2e-16 ***
## regionzone_5                                    < 2e-16 ***
## cancellation_policymoderate                     0.86997    
## cancellation_policystrict_14_with_grace_period  < 2e-16 ***
## cancellation_policysuper_strict_30              0.78937    
## cancellation_policysuper_strict_60              0.95804    
## 
## Residual standard error: 0.49 on 6005 degrees of freedom
## Multiple R-squared:  0.307,  Adjusted R-squared:  0.305 
## F-statistic:  140 on 19 and 6005 DF,  p-value: <2e-16
car::vif(model7)
##                      GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.29  4            1.03
## number_of_reviews    1.03  1            1.01
## room_type            1.39  3            1.06
## bathrooms            1.22  1            1.11
## bedrooms             1.75  1            1.32
## accommodates         1.87  1            1.37
## region               1.05  4            1.01
## cancellation_policy  1.06  4            1.01

The cancellation policy of 14 days seems to have a significant impact on the price for 4 nights. This is why we will keep the variable “cancellation policy” in our model. Our Adjusted R-squared again went up by one percent. Let us keep trying adding more variables that may turn out significant for our model.

3.8 Final Model

Now we will create the model with numerous significant data that we checked to be relevant and significant to create our best fitting regression model.

model_wild_west<-lm(log10(total_price_4_days)~ #predicting total_price_4_days on variables below
                      prop_type_simplified+
                      number_of_reviews* #multiplied because of colinearity
                      reviews_per_month+
                      room_type*  # multiplied because of colinearity 
                      bedrooms+
                      bathrooms+
                      accommodates+
                      region+
                      cancellation_policy+
                      review_scores_value+
                      review_scores_cleanliness+
                      review_scores_checkin+
                      review_scores_location+
                      security_deposit+
                      rating_group+
                      instant_bookable+
                      availability_365+
                      availability_90+
                      maximum_nights+
                      minimum_nights+
                      is_elevator+
                      is_shampoo,
                    data=munich_listings_region)
msummary(model_wild_west) 
##                                                 Estimate Std. Error t value
## (Intercept)                                     2.35e+00   5.21e-02   45.09
## prop_type_simplifiedCondominium                 2.03e-02   1.54e-02    1.32
## prop_type_simplifiedHouse                      -2.52e-02   1.47e-02   -1.71
## prop_type_simplifiedLoft                        4.56e-02   2.14e-02    2.13
## prop_type_simplifiedOther                       1.87e-02   1.39e-02    1.34
## number_of_reviews                              -1.01e-03   1.28e-04   -7.86
## reviews_per_month                              -4.74e-02   3.97e-03  -11.94
## room_typeHotel room                             2.74e-01   6.36e-02    4.30
## room_typePrivate room                          -5.22e-04   1.07e-02   -0.05
## room_typeShared room                           -1.32e-01   2.41e-02   -5.48
## bedrooms                                        5.58e-02   5.56e-03   10.02
## bathrooms                                       4.55e-02   9.43e-03    4.82
## accommodates                                    5.00e-02   2.38e-03   21.01
## regionzone_2                                   -5.66e-02   7.07e-03   -8.00
## regionzone_3                                   -7.67e-02   7.61e-03  -10.08
## regionzone_4                                   -7.96e-02   9.27e-03   -8.58
## regionzone_5                                   -1.28e-01   8.61e-03  -14.88
## cancellation_policymoderate                     5.87e-03   6.37e-03    0.92
## cancellation_policystrict_14_with_grace_period  4.40e-02   6.69e-03    6.58
## cancellation_policysuper_strict_30             -4.29e-02   2.01e-01   -0.21
## cancellation_policysuper_strict_60             -6.90e-02   2.00e-01   -0.34
## review_scores_value                            -4.61e-02   3.47e-03  -13.28
## review_scores_cleanliness                       2.21e-02   3.30e-03    6.70
## review_scores_checkin                           1.27e-02   4.25e-03    2.99
## review_scores_location                          2.11e-02   4.22e-03    4.99
## security_deposit                                4.84e-05   7.27e-06    6.65
## rating_groupUnder 90                           -3.06e-02   9.32e-03   -3.28
## instant_bookableTRUE                            3.66e-02   5.77e-03    6.35
## availability_365                                1.39e-04   3.59e-05    3.86
## availability_90                                 7.09e-04   1.12e-04    6.35
## maximum_nights                                  3.34e-06   1.58e-06    2.12
## minimum_nights                                 -1.58e-02   3.26e-03   -4.84
## is_elevatorTRUE                                 1.32e-02   5.39e-03    2.44
## is_shampooTRUE                                  1.04e-02   5.38e-03    1.94
## number_of_reviews:reviews_per_month             1.87e-04   2.43e-05    7.69
## room_typeHotel room:bedrooms                   -1.04e-01   4.78e-02   -2.18
## room_typePrivate room:bedrooms                 -1.03e-01   8.32e-03  -12.40
##                                                Pr(>|t|)    
## (Intercept)                                     < 2e-16 ***
## prop_type_simplifiedCondominium                 0.18710    
## prop_type_simplifiedHouse                       0.08652 .  
## prop_type_simplifiedLoft                        0.03314 *  
## prop_type_simplifiedOther                       0.17900    
## number_of_reviews                               4.5e-15 ***
## reviews_per_month                               < 2e-16 ***
## room_typeHotel room                             1.7e-05 ***
## room_typePrivate room                           0.96121    
## room_typeShared room                            4.5e-08 ***
## bedrooms                                        < 2e-16 ***
## bathrooms                                       1.4e-06 ***
## accommodates                                    < 2e-16 ***
## regionzone_2                                    1.4e-15 ***
## regionzone_3                                    < 2e-16 ***
## regionzone_4                                    < 2e-16 ***
## regionzone_5                                    < 2e-16 ***
## cancellation_policymoderate                     0.35735    
## cancellation_policystrict_14_with_grace_period  5.0e-11 ***
## cancellation_policysuper_strict_30              0.83092    
## cancellation_policysuper_strict_60              0.73042    
## review_scores_value                             < 2e-16 ***
## review_scores_cleanliness                       2.3e-11 ***
## review_scores_checkin                           0.00280 ** 
## review_scores_location                          6.3e-07 ***
## security_deposit                                3.2e-11 ***
## rating_groupUnder 90                            0.00103 ** 
## instant_bookableTRUE                            2.3e-10 ***
## availability_365                                0.00011 ***
## availability_90                                 2.3e-10 ***
## maximum_nights                                  0.03416 *  
## minimum_nights                                  1.3e-06 ***
## is_elevatorTRUE                                 0.01473 *  
## is_shampooTRUE                                  0.05265 .  
## number_of_reviews:reviews_per_month             1.7e-14 ***
## room_typeHotel room:bedrooms                    0.02899 *  
## room_typePrivate room:bedrooms                  < 2e-16 ***
## 
## Residual standard error: 0.2 on 5988 degrees of freedom
## Multiple R-squared:  0.39,   Adjusted R-squared:  0.386 
## F-statistic:  106 on 36 and 5988 DF,  p-value: <2e-16
model_wild_west_colinear<-lm(log10(total_price_4_days)~ #predicting total_price_4_days on variables below
                      prop_type_simplified+
                      number_of_reviews+ #linearised for colinearity
                      reviews_per_month+
                      room_type+  # linearised for colinearity 
                      bedrooms+
                      bathrooms+
                      accommodates+
                      region+
                      cancellation_policy+
                      review_scores_value+
                      review_scores_cleanliness+
                      review_scores_checkin+
                      review_scores_location+
                      security_deposit+
                      rating_group+
                      instant_bookable+
                      availability_365+
                      availability_90+
                      maximum_nights+
                      minimum_nights+
                      is_elevator+
                      is_shampoo,
                    data=munich_listings_region)
car::vif(model_wild_west_colinear) # car VIF struggles with multiplied variables so a new unmultiplied model is used to check.
##                           GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified      1.40  4            1.04
## number_of_reviews         2.47  1            1.57
## reviews_per_month         2.61  1            1.62
## room_type                 1.62  3            1.08
## bedrooms                  1.78  1            1.34
## bathrooms                 1.23  1            1.11
## accommodates              1.93  1            1.39
## region                    1.12  4            1.01
## cancellation_policy       1.14  4            1.02
## review_scores_value       1.80  1            1.34
## review_scores_cleanliness 1.69  1            1.30
## review_scores_checkin     1.49  1            1.22
## review_scores_location    1.42  1            1.19
## security_deposit          1.07  1            1.03
## rating_group              1.62  1            1.27
## instant_bookable          1.09  1            1.04
## availability_365          2.53  1            1.59
## availability_90           2.41  1            1.55
## maximum_nights            1.02  1            1.01
## minimum_nights            1.15  1            1.07
## is_elevator               1.07  1            1.04
## is_shampoo                1.05  1            1.03

In our final model we tested variables from the previous models that were significant and tested much more variables that in our opinion could as well affect the total_price_4_days. We tested the variables connected to review scores - i.e. review_scores_value, review_scores_cleanliness, review_scores_checking_ review_scores_location etc. Only the ones mentioned turned out to be significant for the model. Afterwards we check for security_deposit, rating_group, instants_bookable and availability variables. Two of them (availability_60 and availability_30) turned out to be insignificant, so we decided to drop them. Thereafter, we added host_listings_count as we believe that the number of properties the host has may affect the standard, build some economies of scales perhaps and therefore affect somehow the price. This factor as well turned out to be significant. Later we tested maximum_nights and minimum_nights. In the next step we were testing whether particular types of amenities have any significant impact on the price. It turned out that two of them - elevator and shampoo (as they are always part of some welcome packs) are also significant for the price’s prediction. Moreover, we added two interaction variables - room_type&bedrooms and number_of_reviews&review_per_month as we believe there is much interaction happening between them. Our final model has adjusted R-squared at the level of 38.7% and a RSE at the level of 0.2. Checking the VIF throughout, we can see that the GVIF value is well below 5 and we can be assured that the colinearity is not affecting our model significantly.

4 Diagnostics

4.1 Checking Residuals

In the next step we will plot residuals, analyze their behaviour and check whether they are distributed within the norms. Afterwards we will compare all the models and compare how our models evolved.

#plotting residuals
autoplot(model_wild_west)+
  theme_bw()

# comparing significance of variables among model iterations
huxreg(model2, 
       model3, 
       model6, 
       model7, 
       model_wild_west)

(1)(2)(3)(4)(5)
(Intercept)6.150 ***5.567 ***5.714 ***5.685 ***2.348 ***
(0.080)   (0.027)   (0.028)   (0.029)   (0.052)   
prop_type_simplifiedCondominium0.115 ** 0.081 *  0.068    0.058    0.020    
(0.042)   (0.039)   (0.038)   (0.038)   (0.015)   
prop_type_simplifiedHouse0.043    -0.104 ** -0.051    -0.059    -0.025    
(0.039)   (0.036)   (0.036)   (0.035)   (0.015)   
prop_type_simplifiedLoft0.198 ***0.111 *  0.113 *  0.114 *  0.046 *  
(0.059)   (0.054)   (0.053)   (0.052)   (0.021)   
prop_type_simplifiedOther0.113 ** 0.009    0.014    0.031    0.019    
(0.037)   (0.034)   (0.034)   (0.033)   (0.014)   
number_of_reviews-0.001 ***-0.001 ***-0.002 ***-0.002 ***-0.001 ***
(0.000)   (0.000)   (0.000)   (0.000)   (0.000)   
review_scores_rating0.000                                    
(0.001)                                   
room_typeHotel room0.310 ** 0.538 ***0.487 ***0.496 ***0.274 ***
(0.102)   (0.093)   (0.091)   (0.091)   (0.064)   
room_typePrivate room-0.377 ***-0.249 ***-0.234 ***-0.227 ***-0.001    
(0.015)   (0.014)   (0.014)   (0.014)   (0.011)   
room_typeShared room-0.243 ***-0.287 ***-0.266 ***-0.280 ***-0.132 ***
(0.066)   (0.060)   (0.059)   (0.058)   (0.024)   
bathrooms        0.122 ***0.113 ***0.109 ***0.045 ***
        (0.024)   (0.023)   (0.023)   (0.009)   
bedrooms        0.052 ***0.042 ***0.045 ***0.056 ***
        (0.013)   (0.012)   (0.012)   (0.006)   
beds        -0.025 **                         
        (0.008)                           
accommodates        0.147 ***0.137 ***0.130 ***0.050 ***
        (0.007)   (0.006)   (0.006)   (0.002)   
regionzone_2                -0.135 ***-0.129 ***-0.057 ***
                (0.017)   (0.017)   (0.007)   
regionzone_3                -0.192 ***-0.182 ***-0.077 ***
                (0.019)   (0.018)   (0.008)   
regionzone_4                -0.195 ***-0.188 ***-0.080 ***
                (0.023)   (0.022)   (0.009)   
regionzone_5                -0.331 ***-0.322 ***-0.128 ***
                (0.021)   (0.021)   (0.009)   
cancellation_policymoderate                        0.003    0.006    
                        (0.015)   (0.006)   
cancellation_policystrict_14_with_grace_period                        0.133 ***0.044 ***
                        (0.016)   (0.007)   
cancellation_policysuper_strict_30                        0.131    -0.043    
                        (0.491)   (0.201)   
cancellation_policysuper_strict_60                        0.026    -0.069    
                        (0.490)   (0.200)   
reviews_per_month                                -0.047 ***
                                (0.004)   
review_scores_value                                -0.046 ***
                                (0.003)   
review_scores_cleanliness                                0.022 ***
                                (0.003)   
review_scores_checkin                                0.013 ** 
                                (0.004)   
review_scores_location                                0.021 ***
                                (0.004)   
security_deposit                                0.000 ***
                                (0.000)   
rating_groupUnder 90                                -0.031 ** 
                                (0.009)   
instant_bookableTRUE                                0.037 ***
                                (0.006)   
availability_365                                0.000 ***
                                (0.000)   
availability_90                                0.001 ***
                                (0.000)   
maximum_nights                                0.000 *  
                                (0.000)   
minimum_nights                                -0.016 ***
                                (0.003)   
is_elevatorTRUE                                0.013 *  
                                (0.005)   
is_shampooTRUE                                0.010    
                                (0.005)   
number_of_reviews:reviews_per_month                                0.000 ***
                                (0.000)   
room_typeHotel room:bedrooms                                -0.104 *  
                                (0.048)   
room_typePrivate room:bedrooms                                -0.103 ***
                                (0.008)   
room_typeShared room:bedrooms                                        
                                        
N6025        6025        6025        6025        6025        
R20.116    0.264    0.297    0.307    0.390    
logLik-4973.056    -4419.501    -4283.651    -4238.792    1169.906    
AIC9968.112    8867.003    8601.301    8519.583    -2263.812    
*** p < 0.001; ** p < 0.01; * p < 0.05.
The residuals behave in an appropriate way, hence we assume that the model is correct. Though there is a slight gradient in Scale-Location, and slight tendency in Residuals vs Fitted. The Leverage tends around the mean and the normal Q-Q is linear for the most part. These slight issues are due to the quality of the data scraper.

From the table comparing all the models we can spot, that our R-squared went up through out the process of finding the best solution. We can as well spot which variables were added and dropped at which stages.

##Model applyinh and predicting the outcome

Now, we will find a price of the Airbnbs that are apartment with a private room, have at least 10 reviews, and an average rating of at least 90.

We are using our logarithmic model log(total_price_4_days) in the predict function since our regression is based in the log(total_price_4_days). First, we will create a new table that we will filter according to the conditions above. In the next step, we will anti-log our model_wild_west. At the end, we will predict the prices for our filtered accommodations and we will create for them the Confidence Intervals. We will do it in two ways in order to compare our scores.

munich_listings_predict<- munich_listings_region %>%
  mutate(price=log(total_price_4_days)) %>% #converting to log form for prediction
  filter(room_type=="Private room" &
           number_of_reviews>=10 & 
           rating_group=="Over 90")


predict_df<-10^predict(model_wild_west, # converting from log form to nominal
                    newdata = munich_listings_predict, 
                    interval= "confidence")
#sanity check
summary(predict_df)
##       fit           lwr           upr     
##  Min.   :138   Min.   :119   Min.   :160  
##  1st Qu.:255   1st Qu.:240   1st Qu.:271  
##  Median :297   Median :280   Median :315  
##  Mean   :312   Mean   :293   Mean   :333  
##  3rd Qu.:352   3rd Qu.:332   3rd Qu.:371  
##  Max.   :788   Max.   :643   Max.   :993
#using broom augment
model_prediction <- broom::augment(model_wild_west, 
                                   newdata= munich_listings_predict)
model_prediction <- model_prediction %>% 
  mutate(lower_95=10^.fitted-1.96*abs(10^(.resid)),#creating 95% confidence interval
         upper_95=10^.fitted+1.96*abs(10^(.resid))) %>% 
  select(.fitted,
         lower_95,
         upper_95, 
         total_price_4_days) %>% 
  mutate(.fitted=10^.fitted)
#sanity check
summary(model_prediction)
##     .fitted       lower_95      upper_95   total_price_4_days
##  Min.   :138   Min.   :135   Min.   :140   Min.   :  92      
##  1st Qu.:255   1st Qu.:253   1st Qu.:257   1st Qu.: 215      
##  Median :297   Median :295   Median :299   Median : 275      
##  Mean   :312   Mean   :310   Mean   :314   Mean   : 324      
##  3rd Qu.:352   3rd Qu.:350   3rd Qu.:353   3rd Qu.: 376      
##  Max.   :788   Max.   :788   Max.   :789   Max.   :4036

Using the predict and augment function we observe a mean price of around 315, which is close to the actual total mean price of 327. We also see that the 1st and 3rd quartiles for both of the prediction methods all line up. Differences appear in the lower and upper confidence level boundaries between the two functions, where the predict function’s interval actually captures the true mean, the augment misses it by €10. Despite this we can get a sense of confidence for our linear regression’s accuracy due to the tight spread and capturing of the true mean. The next step is conducting a sanity check by checking the RMSE of our model.

##Data Training and RMSE In the next step we will split our data into two parts. We will train one part and later test another one. In the next step we will compare the results.

set.seed(1234)
train_test_split <- initial_split(munich_listings_predict, prop=0.7) # splitting dataset
munich_train<- training(train_test_split)
munich_test<- testing(train_test_split)

rmse_train <- munich_train %>%  #training portion for RMSE
  mutate(predictions=predict(model_wild_west,.)) %>% 
  summarise(sqrt(sum(predictions-log(total_price_4_days))**2/n())) %>% 
  pull()
rmse_train
## [1] 81.2
rmse_test <- munich_test %>% 
  mutate(predictions=predict(model_wild_west,.)) %>% 
  summarise(sqrt(sum(predictions-log(total_price_4_days))**2/n())) %>% 
  pull()
rmse_test
## [1] 52.8

We can see that the RMSE is an order of magnitude below our prices, which confirms that though our R^2 is low, our accuracy is very high.

Thank you for your interest in our study project. We hope you found it interesting.

Authors Magdalena Cloppenburg, Yichun Hou, Derek Leung, Malay Memani, Samy Mohamad, Agnieszka Prawda.